Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.
date()
## [1] "Mon Dec 4 19:57:23 2023"
[1] “Thu Nov 2 13:02:54 2023”
There seems to be many new things to learn, especially with GitHub with me. Learning to work with GitHub, as mentioned, I am expecting to improve during the course, as well as statistics on large data sets with R.
The Exercise1.Rmd-file had nice summary on handling the data in R in chapters 1 & 2. Some of them I had gotten more familiar with, but there where a lot of things to learn, especially chapter 2 had nice suggestions/tricks in the examples how to handle data frames.
Using RStudio seems still quite fine, loading files to GitHub seems still a bit unclear and I hope I will get more comfortable with it as the course goes on. Exercise set 1. was quite extensive, most was familiar/known, but it’s not bad to refresh your memory.
GitHub repository: https://github.com/venkatharina/IODS-project GitHub diary: https://venkatharina.github.io/IODS-project/
date()
## [1] "Mon Dec 4 19:57:23 2023"
R-scipt: https://github.com/venkatharina/IODS-project/blob/master/Chapter2_Data_Wrangling.R
newfile <- read.table(url("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/learning2014.txt"), sep = ",", header = TRUE)
dim(newfile) #166 rows and 7 columns
## [1] 166 7
head(newfile)
## gender age attitude deep stra surf points
## 1 F 53 3.7 3.583333 3.375 2.583333 25
## 2 M 55 3.1 2.916667 2.750 3.166667 12
## 3 F 49 2.5 3.500000 3.625 2.250000 24
## 4 M 53 3.5 3.500000 3.125 2.250000 10
## 5 M 49 3.7 3.666667 3.625 2.833333 22
## 6 F 38 3.8 4.750000 3.625 2.416667 21
colnames(newfile)
## [1] "gender" "age" "attitude" "deep" "stra" "surf" "points"
#column names:"gender","age","attitude","deep","stra","surf","points"
###
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(ggplot2)
summary(newfile)
## gender age attitude deep
## Length:166 Min. :17.00 Min. :1.400 Min. :1.583
## Class :character 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333
## Mode :character Median :22.00 Median :3.200 Median :3.667
## Mean :25.51 Mean :3.143 Mean :3.680
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083
## Max. :55.00 Max. :5.000 Max. :4.917
## stra surf points
## Min. :1.250 Min. :1.583 Min. : 7.00
## 1st Qu.:2.625 1st Qu.:2.417 1st Qu.:19.00
## Median :3.188 Median :2.833 Median :23.00
## Mean :3.121 Mean :2.787 Mean :22.72
## 3rd Qu.:3.625 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :5.000 Max. :4.333 Max. :33.00
table(newfile$gender)
##
## F M
## 110 56
ggpairs(newfile, lower = list(combo = wrap("facethist", bins = 20)))
#as individuals mean values are age:25.5, attitude:3.143, deep:3.68, stra:3.12, surf:2.78, points:22.7
#there are 110 females, and 56 males
#there is significant correlation between surf-attitude*, points-attitude***, surf-deep*** and surf-stra*
#visually, attitude, stra, surf seem normally distributed
#choosing three variables: attitude, surf, stra
model = lm(points ~ attitude+surf+stra, data=newfile)
summary(model)
##
## Call:
## lm(formula = points ~ attitude + surf + stra, data = newfile)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## surf -0.5861 0.8014 -0.731 0.46563
## stra 0.8531 0.5416 1.575 0.11716
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
par(mfrow = c(2,2))
#attitude with p-value 1.93e-08 *** seem highly significant in the model
#where as surf and stra are not significant to explain points
#model explains ~20% of variance of points (multiple r-squared)
#if we remove variables surf and stra we get model2
model2 = lm(points ~ attitude, data=newfile)
summary(model2)
##
## Call:
## lm(formula = points ~ attitude, data = newfile)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
#attitude explains significantly points but only 19% of variance (multiple r-squared)
#plotting models
par(mfrow = c(2,2))
plot(model)
par(mfrow = c(2,2))
plot(model2)
#generally all the plots look good (model linear)
#top left: residuals vs fitted plot; close to zero (independent from each other), fitted values are forecasts of observed values, residuals what are left out of forecast
#to right: QQ-plot; residuals compared to normal distribution seem very linear (on the line)
#low left: scale-location plot; absolute residuals values with fitted values, similar magnitudes of residuals across fitted values
#low right: residuals vs leverage plot; how far independent variables are from observations; no points with high residue and high leverage
date()
## [1] "Mon Dec 4 19:57:26 2023"
R-Script: https://github.com/venkatharina/IODS-project/blob/Data/'create_alc.R
#reading table
a <- read.table(url("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/alc.csv"), sep = ",", header = TRUE)
#looks same as I got from Data wrangling exercises
dim(a) #dimension of table: 370 observations and 35 variables
## [1] 370 35
str(a) #structure of table
## 'data.frame': 370 obs. of 35 variables:
## $ school : chr "GP" "GP" "GP" "GP" ...
## $ sex : chr "F" "F" "F" "F" ...
## $ age : int 18 17 15 15 16 16 16 17 15 15 ...
## $ address : chr "U" "U" "U" "U" ...
## $ famsize : chr "GT3" "GT3" "LE3" "GT3" ...
## $ Pstatus : chr "A" "T" "T" "T" ...
## $ Medu : int 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : int 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : chr "at_home" "at_home" "at_home" "health" ...
## $ Fjob : chr "teacher" "other" "other" "services" ...
## $ reason : chr "course" "course" "other" "home" ...
## $ guardian : chr "mother" "father" "mother" "mother" ...
## $ traveltime: int 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : int 2 2 2 3 2 2 2 2 2 2 ...
## $ schoolsup : chr "yes" "no" "yes" "no" ...
## $ famsup : chr "no" "yes" "no" "yes" ...
## $ activities: chr "no" "no" "no" "yes" ...
## $ nursery : chr "yes" "no" "yes" "yes" ...
## $ higher : chr "yes" "yes" "yes" "yes" ...
## $ internet : chr "no" "yes" "yes" "yes" ...
## $ romantic : chr "no" "no" "no" "yes" ...
## $ famrel : int 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : int 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : int 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : int 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : int 1 1 3 1 2 2 1 1 1 1 ...
## $ health : int 3 3 3 5 5 5 3 1 1 5 ...
## $ failures : int 0 0 2 0 0 0 0 0 0 0 ...
## $ paid : chr "no" "no" "yes" "yes" ...
## $ absences : int 5 3 8 1 2 8 0 4 0 0 ...
## $ G1 : int 2 7 10 14 8 14 12 8 16 13 ...
## $ G2 : int 8 8 10 14 12 14 12 9 17 14 ...
## $ G3 : int 8 8 11 14 12 14 12 10 18 14 ...
## $ alc_use : num 1 1 2.5 1 1.5 1.5 1 1 1 1 ...
## $ high_use : logi FALSE FALSE TRUE FALSE FALSE FALSE ...
colnames(a) #column names of table
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "schoolsup"
## [16] "famsup" "activities" "nursery" "higher" "internet"
## [21] "romantic" "famrel" "freetime" "goout" "Dalc"
## [26] "Walc" "health" "failures" "paid" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
#There are 35 variables, with logical, integer, double and character vectors
#Table predicts student performance in secondary education (high school) with alcohol consumption
#The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires
#To study relation to low/high alcohol consumption:
#I selected Fedu, studytime, romantic, absences
#I think maybe parents higher education, high studytime would lead to low alcohol consumption
#And romantic relationships and absences from school would high alcohol consumption
##############################################################################
c.names = c("Fedu","studytime","romantic","absences")
par(mfrow = c(2,2))
for(ii in 1:length(c.names)){
barplot(table(a[,c.names[ii]]), main = c.names[ii], col="blue")
}
#all females are educated, many also highly
#usually people study around 2h/week
#majority are not in a romantic relationship
#majority of absences are between 0-5h
#alcohol consumption vs. female education
table(a$alc_use,a$Fedu)
##
## 0 1 2 3 4
## 1 2 25 42 37 34
## 1.5 0 15 15 18 15
## 2 0 9 18 15 14
## 2.5 0 7 9 13 12
## 3 0 8 10 8 6
## 3.5 0 5 5 1 6
## 4 0 1 1 5 2
## 4.5 0 1 1 0 1
## 5 0 2 4 0 3
plot(table(a$alc_use,a$Fedu))
#there does not seem to be very clear pattern
#overall people seem to drink only 1 dose/week, no matter on mothers education
#maybe at very high 5 doses/week, education of the mother is 1-2
#alcohol consumption vs. studytime
table(a$alc_use,a$studytime)
##
## 1 2 3 4
## 1 21 72 30 17
## 1.5 18 31 10 4
## 2 17 25 12 2
## 2.5 10 25 5 1
## 3 12 18 1 1
## 3.5 11 4 2 0
## 4 4 4 0 1
## 4.5 0 3 0 0
## 5 5 3 0 1
plot(table(a$alc_use,a$studytime))
#people who study less (1h), seem to drink more in high dosages (3.5-5)
#people who study more (4h), seem to drink less overall
#average people study 2h/week and drink 1 dose
#alcohol consumption vs. romantic
table(a$alc_use,a$romantic)
##
## no yes
## 1 98 42
## 1.5 42 21
## 2 33 23
## 2.5 29 12
## 3 25 7
## 3.5 13 4
## 4 6 3
## 4.5 1 2
## 5 4 5
plot(table(a$alc_use,a$romantic))
#it seems that overall majority drinks only 1 dose/week
#number of people in a romantic relationship seem to drink less,
#until we get into very high doses (4.5-5)
#alcohol consumption vs. absences
table(a$alc_use,a$absences)
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 16 17 18 19 20 21 26 27 29
## 1 31 18 24 21 14 4 10 3 6 2 4 0 1 0 0 0 0 0 0 0 1 0 0 0
## 1.5 10 10 8 6 5 7 3 2 4 2 0 2 0 1 0 0 0 1 0 2 0 0 0 0
## 2 9 9 8 4 5 5 3 4 4 1 1 0 2 0 1 0 0 0 0 0 0 0 0 0
## 2.5 6 5 3 4 5 2 3 1 1 3 0 2 0 0 3 0 1 0 0 0 0 0 1 0
## 3 4 6 6 1 2 2 1 0 2 1 0 1 2 0 0 1 0 0 1 0 0 1 0 1
## 3.5 3 2 2 0 3 1 0 1 1 1 1 0 0 0 0 0 0 1 0 0 1 0 0 0
## 4 0 0 3 1 0 0 1 1 1 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0
## 4.5 0 0 0 0 0 0 0 0 1 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0
## 5 0 0 2 1 1 1 0 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0
##
## 44 45
## 1 0 1
## 1.5 0 0
## 2 0 0
## 2.5 1 0
## 3 0 0
## 3.5 0 0
## 4 0 0
## 4.5 0 0
## 5 0 0
plot(table(a$alc_use,a$absences))
#people who drink less (1 dose/week), seem to be have less absences
#people who drink more (4-5 doses/week), seem to have more absences
#majority of people still drink less as said before
#previously I predicted that high mothers education and high studytime would lead to less drinking
#seems by estimation that mothers education does not seem to have such a huge effect
#but high studytime seem to lead to less alcohol consumption
#also
#I predicted that romantic relationships and high amount of absences leads to high alcohol consumption
#seem romantic relationships effect the opposite way, to less alcohol
#low absences lead to low alcohol consumption, high absences to high alcohol consumption
##############################################################################
#doing logistic regression to high_use (high alcohol consumption)
#where Fedu, studytime, absences and romantic status are predictors
summary(a$high_use)#high_use FALSE 259, TRUE 111
## Mode FALSE TRUE
## logical 259 111
table(as.numeric(a$high_use),a$high_use) #FALSE 0, TRUE 1
##
## FALSE TRUE
## 0 259 0
## 1 0 111
#high_use outcome variable --> modeling TRUE answer
m=glm(high_use ~ Fedu + studytime + absences + romantic, data=a)
summary(m)
##
## Call:
## glm(formula = high_use ~ Fedu + studytime + absences + romantic,
## data = a)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.4473666 0.0848465 5.273 2.31e-07 ***
## Fedu -0.0001055 0.0211342 -0.005 0.996018
## studytime -0.1023084 0.0272659 -3.752 0.000204 ***
## absences 0.0171176 0.0042214 4.055 6.13e-05 ***
## romanticyes -0.0474788 0.0493497 -0.962 0.336641
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.1941708)
##
## Null deviance: 77.700 on 369 degrees of freedom
## Residual deviance: 70.872 on 365 degrees of freedom
## AIC: 450.54
##
## Number of Fisher Scoring iterations: 2
#Fedu and romanticYES does not seem to be significant in this model
##Fedu would decrease alcohol consumption but not significantly
##romanticYESwould decrease alcohol consumption but not significantly
#studytime seems to decrease probability for high alcohol consumption
#abcences seem to increase probability for high alcohol consumption
#I predicted: mother education and studytime would decrease alcohol consumption
#And: romantic relationship and absences would increase alcohol consumption
#Mothers education and studytime will predict decreased alcohol consumption, but mothers eduction does not do that significantly
#Romantic relationships do not increase alcohol consumption, it decreases it, but not significantly
#Absences do increase high alcohol consumption as I predicted in the beginning
##############################################################################
#creating new logistic regression model with significant variables
#high_use ~ studytime + absences significantly
mm=glm(high_use ~ studytime + absences, data=a)
summary(mm)
##
## Call:
## glm(formula = high_use ~ studytime + absences, data = a)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.43618 0.06491 6.720 6.95e-11 ***
## studytime -0.10350 0.02720 -3.805 0.000166 ***
## absences 0.01669 0.00419 3.984 8.19e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.1936026)
##
## Null deviance: 77.700 on 369 degrees of freedom
## Residual deviance: 71.052 on 367 degrees of freedom
## AIC: 447.48
##
## Number of Fisher Scoring iterations: 2
#creating new table with probabilities and predictions of model
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
alc <- mutate(a, probability = predict(mm, type = "response"))
alc <- mutate(alc, prediction = probability > 0.5)
#true values of high alcohol use
actual <- summary(alc$high_use)
#predicted values of high alcohol use with studytime and absences
prediction <- summary(alc$prediction)
act_pred <- cbind(actual,prediction)
act_pred
## actual prediction
## Mode "logical" "logical"
## FALSE "259" "349"
## TRUE "111" "21"
#actual: there are 259 who have low alcohol consumption, as 111 have high
#model: there are 349 who have low alcohol consumption, as 21 have high
#plotting the actual alcohol consumption rates and probability rates:
plot(alc$alc_use)
plot(alc$probability) #plots seems to go down to page
#it seem the model underestimates the high alcohol consumption
#2x2 crosstabular of actual values and predictions
table(alc$high_use,alc$prediction)
##
## FALSE TRUE
## FALSE 252 7
## TRUE 97 14
#cross-tabulation shows quite many TRUE-FALSE in prediction
#not very good fitting to the model
# define a loss function (average prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = mm, K = nrow(alc))
# average number of wrong predictions in the cross validation
cv$delta[1] #model gives wrong predictions: 0.2864865 = ~29%
## [1] 0.2864865
#guessing predictions with studytime=4, absences=0:
predict(mm, type = "response",
newdata = data.frame(studytime = 4, absences = 0))
## 1
## 0.02218847
#model prediction:0.02218847 = 22%
#when studytime=4, absences=0; alcohol consumption values 1.0, 2.0, 1.0 and 1.5 (scale:1-5)
(((1+2+1+1.5)/4)/5) #average and normalised value
## [1] 0.275
#real: 0.275 = 28%
#model quite right with these settings (real:FALSE, prediction:FALSE)
#guessing predictions with studytime=1, absences=27:
predict(mm, type = "response",
newdata = data.frame(studytime = 1, absences = 27))
## 1
## 0.7833464
#model prediction:0.7833464 = 78%
#when studytime=1, absences=27; alcohol consumption value 2.5 (scale:1-5)
(2.5/5) #average and normalised value
## [1] 0.5
#real: 0.5 = 50%
#model okay-ish with these settings (real:TRUE, prediction:TRUE)
#the model works to some extend okay, but there is a lot of error
#model would need more predictors to make it better
##############################################################################
#10-fold cross-validation
cv10 <- cv.glm(data = alc, cost = loss_func, glmfit = mm, K = 10)
# average number of wrong predictions in the cross validation
cv10$delta[1] #model gives wrong predictions: 0.2837838 = ~28%
## [1] 0.2837838
#Model does not get better with 10-fold cross-validation
#This set seems to have higher prediction error than in the Exercise3 (26%)
#With differ/more predictors such model could be perhaps found
##############################################################################
#making a new logistic regression model with lots of predictors:
m2=glm(high_use ~ age+traveltime+studytime+famrel+freetime+goout+health+failures+absences
, data=a, family = "binomial")
summary(m2)
##
## Call:
## glm(formula = high_use ~ age + traveltime + studytime + famrel +
## freetime + goout + health + failures + absences, family = "binomial",
## data = a)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.85188 2.06322 -1.867 0.06191 .
## age 0.06714 0.11775 0.570 0.56855
## traveltime 0.37155 0.18580 2.000 0.04553 *
## studytime -0.46960 0.17578 -2.671 0.00755 **
## famrel -0.42247 0.14607 -2.892 0.00383 **
## freetime 0.18999 0.14557 1.305 0.19184
## goout 0.70070 0.12886 5.438 5.4e-08 ***
## health 0.16616 0.09720 1.709 0.08736 .
## failures 0.22758 0.23451 0.970 0.33181
## absences 0.06848 0.02202 3.110 0.00187 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 358.70 on 360 degrees of freedom
## AIC: 378.7
##
## Number of Fisher Scoring iterations: 4
#counting average number of wrong predictions
cv2 <- cv.glm(data = a, cost = loss_func, glmfit = m2, K = nrow(a))
cv2$delta[1] #0.2459459 average number of wrong predictions in the cross validation
## [1] 0.2459459
#adding predictions and probability to the table:
alc2 <- mutate(a, probability = predict(m2, type = "response"))
alc2 <- mutate(alc2, prediction = probability > 0.5)
#seeing actual values and predicted ones in one table
prediction2 <- summary(alc2$prediction)
act_pred2 <- cbind(actual,prediction2)
act_pred2
## actual prediction2
## Mode "logical" "logical"
## FALSE "259" "294"
## TRUE "111" "76"
#model2 (m2) is getting closer to the real data than model mm
#2x2 crosstabular of actual values and predictions:
table(alc2$high_use,alc2$prediction)
##
## FALSE TRUE
## FALSE 232 27
## TRUE 62 49
#There are more FALSE-TRUE, but also way more TRUE-FALSE
#Overall the model seems to be better
#adjusting the model according to significant predictors:
m3=glm(high_use ~ traveltime+studytime+famrel+goout+health+absences
, data=a, family = "binomial")
summary(m3)
##
## Call:
## glm(formula = high_use ~ traveltime + studytime + famrel + goout +
## health + absences, family = "binomial", data = a)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.35767 0.84403 -2.793 0.00522 **
## traveltime 0.38673 0.18207 2.124 0.03366 *
## studytime -0.52063 0.17144 -3.037 0.00239 **
## famrel -0.40296 0.14244 -2.829 0.00467 **
## goout 0.77461 0.12334 6.280 3.38e-10 ***
## health 0.17958 0.09532 1.884 0.05957 .
## absences 0.06861 0.02191 3.132 0.00174 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 362.02 on 363 degrees of freedom
## AIC: 376.02
##
## Number of Fisher Scoring iterations: 4
#counting average number of wrong predictions:
cv3 <- cv.glm(data = a, cost = loss_func, glmfit = m3, K = nrow(a))
cv33 <- cv3$delta[1] #0.2378378 average number of wrong predictions in the cross validation
#adding predictions and probability to the table:
alc3 <- mutate(a, probability = predict(m3, type = "response"))
alc3 <- mutate(alc3, prediction = probability > 0.5)
#seeing actual values and predicted ones in one table:
prediction3 <- summary(alc3$prediction)
act_pred3 <- cbind(actual,prediction3)
act_pred3
## actual prediction3
## Mode "logical" "logical"
## FALSE "259" "293"
## TRUE "111" "77"
#model3 (m3) is getting even closer to the real data (predictability better)
#2x2 crosstabular of actual values and predictions:
table(alc3$high_use,alc3$prediction)
##
## FALSE TRUE
## FALSE 232 27
## TRUE 61 50
#There are more FALSE-TRUE, but also way more TRUE-FALSE
#Overall the model seems to be better
#In both models where there are more predictors/more significant predictors , the prediction model gets better (error smaller and model fit better with cross tabulars)
#I did not know how to make a graph to show both training and testing errors by the number of predictors in the model
date()
## [1] "Mon Dec 4 19:57:28 2023"
R-Script: https://github.com/venkatharina/IODS-project/blob/Data/create_human.R
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(tidyr)
library(corrplot)
## corrplot 0.92 loaded
library(ggplot2)
#loading the data
data("Boston")
#exploring the dataset
dim(Boston) #504 rows and 14 variables
## [1] 506 14
str(Boston) #numeric and integrin vectors
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
#Data describes methodological problems associated with the use of housing market data to measure the willingness to pay for clean air
#With the use of a hedonic housing price model and data for the Boston metropolitan area, quantitative estimates of the willingness to pay for air quality improvements are generated
#Marginal air pollution damages (as revealed in the housing market) are found to increase with the level of air pollution and with household income
#graphical overview and summary of the data:
plot(Boston)
pairs(Boston)
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
cor_matrix <- cor(Boston)
corrplot(cor_matrix, method="circle")
#lots of variables, some seem to not correlate, some seem to correlate negative/positive to each others
#scaling data:
boston_scaled <- as.data.frame(scale(Boston))
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
#all the data was scaled to zero (mean to zero)
#`crim` (per capita crime rate by town)
#cut the variable by to get the high, low and middle rates of crime into their own categories
boston_scaled$crim <- as.numeric(boston_scaled$crim)
#using the quantiles as the break points (bins)
bins <- quantile(boston_scaled$crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
#creating categorical variable
crime <- cut(boston_scaled$crim, breaks =bins, include.lowest = TRUE)
#removing original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
#adding the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
#creating testing and training data:
n <- nrow(Boston)
ind <- sample(n, size = n * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
#doing linear discriminant analysis
lda.fit = lda(crime ~ ., data=train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## [-0.419,-0.411] (-0.411,-0.39] (-0.39,0.00739] (0.00739,9.92]
## 0.2475248 0.2400990 0.2549505 0.2574257
##
## Group means:
## zn indus chas nox rm
## [-0.419,-0.411] 0.8676773 -0.8743397 -0.19358706 -0.8798680 0.4309682
## (-0.411,-0.39] -0.1291940 -0.2638612 0.01179157 -0.5462142 -0.1517527
## (-0.39,0.00739] -0.3856672 0.2864380 0.30103504 0.4469401 0.1130815
## (0.00739,9.92] -0.4872402 1.0170690 -0.04518867 1.0863340 -0.4266874
## age dis rad tax ptratio
## [-0.419,-0.411] -0.8831228 0.8551820 -0.6981998 -0.7160272 -0.45198994
## (-0.411,-0.39] -0.2385570 0.3213141 -0.5603720 -0.5091726 -0.03755605
## (-0.39,0.00739] 0.4094339 -0.4081168 -0.4243629 -0.2892640 -0.34898504
## (0.00739,9.92] 0.8074894 -0.8501113 1.6386213 1.5144083 0.78135074
## black lstat medv
## [-0.419,-0.411] 0.37181027 -0.74283960 0.50093960
## (-0.411,-0.39] 0.31180364 -0.09135079 -0.01387955
## (-0.39,0.00739] 0.02797664 0.02348579 0.22761592
## (0.00739,9.92] -0.74922376 0.85231810 -0.68396515
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.07573442 0.642582577 -0.8777281
## indus 0.06467029 -0.281799252 0.2842424
## chas -0.09415488 -0.128864688 0.0796810
## nox 0.36913553 -0.785514540 -1.3149356
## rm -0.08922996 -0.094113619 -0.2125112
## age 0.20688024 -0.264001262 0.2103381
## dis -0.01899534 -0.256018753 0.2266716
## rad 3.39016563 0.923254217 0.1736943
## tax 0.04053785 0.009158699 0.1330706
## ptratio 0.12804937 -0.032002684 -0.1612895
## black -0.10654682 0.076142431 0.1296544
## lstat 0.25488188 -0.247476685 0.2745129
## medv 0.22610267 -0.448557564 -0.2209189
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9542 0.0356 0.0102
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
graphics::arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
classes <- as.numeric(train$crime)
#plotting lda results biplot)
plot(lda.fit, dimen = 2)
lda.arrows(lda.fit, myscale = 1)
#saving the crime categories from the test set
correct_classes <- test$crime
#then removing the categorical crime variable from the test dataset
test <- dplyr::select(test, -crime)
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct [-0.419,-0.411] (-0.411,-0.39] (-0.39,0.00739] (0.00739,9.92]
## [-0.419,-0.411] 19 7 1 0
## (-0.411,-0.39] 6 19 4 0
## (-0.39,0.00739] 0 14 7 2
## (0.00739,9.92] 0 0 0 23
#predicted values seem correct only in class 4 (0.00739,9.92)
#mostly predictions seem okay-ish (majority in right class), but it could be better
#reloading the data
data("Boston")
#scaling data:
boston_scaled <- as.data.frame(scale(Boston))
summary(boston_scaled$dis)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.2658 -0.8049 -0.2790 0.0000 0.6617 3.9566
#calculating distances between the observations
distance <- dist(boston_scaled)
summary(distance)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
#trying to identify right cluster number
set.seed(140)
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})
qplot(x = 1:k_max, y = twcss, geom = 'line')
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
#right cluster number seems to be 6 for k-means clustering
km <- kmeans(boston_scaled, centers = 6)
# plot the dataset with clusters
pairs(boston_scaled, col = km$cluster)
############ BONUS ############
#reloading data:
data("Boston")
#scaling data:
boston_scaled <- as.data.frame(scale(Boston))
#k-clustering
km3 <- kmeans(boston_scaled, centers = 8)
boston_scaled$kmcluster <- as.numeric(km3$cluster)
#making lda modeling
lda.fit2 = lda(kmcluster~., data=boston_scaled)
lda.fit2
## Call:
## lda(kmcluster ~ ., data = boston_scaled)
##
## Prior probabilities of groups:
## 1 2 3 4 5 6 7
## 0.21343874 0.04150198 0.21541502 0.06521739 0.10869565 0.15415020 0.10276680
## 8
## 0.09881423
##
## Group means:
## crim zn indus chas nox rm
## 1 -0.3768106 -0.425505051 -0.3701112 0.09221725 -0.2286148 -0.35000772
## 2 3.2082952 -0.487240187 1.0149946 -0.27232907 1.0998477 -1.45566793
## 3 -0.4060502 -0.005364112 -0.6583486 -0.27232907 -0.8405622 -0.04480167
## 4 1.1193656 -0.487240187 1.0149946 -0.27232907 0.9950574 -0.19450805
## 5 -0.4146327 2.524683754 -1.2040537 -0.20074543 -1.2034679 0.73835915
## 6 0.4620310 -0.487240187 1.0149946 0.13147608 1.0021383 -0.15800782
## 7 -0.2727474 -0.487240187 1.5613334 0.10623826 1.1631724 -0.63230595
## 8 -0.3681800 -0.053323566 -0.7442735 0.59383298 -0.2416605 1.68533550
## age dis rad tax ptratio black
## 1 0.3376923 -0.1286995 -0.5799077 -0.53244743 0.22283672 0.2729966
## 2 1.0064301 -1.0710604 1.6596029 1.52941294 0.80577843 -0.1691195
## 3 -1.0244892 0.8700429 -0.5720054 -0.70802741 -0.07989337 0.3722489
## 4 0.7406815 -0.8461740 1.6596029 1.52941294 0.80577843 -3.2850509
## 5 -1.4205771 1.6272606 -0.6832698 -0.53843478 -0.89403340 0.3540831
## 6 0.6920432 -0.7470446 1.6596029 1.52941294 0.80577843 0.1640227
## 7 0.9324774 -0.9083143 -0.6130366 0.03111252 -0.35520300 -0.1409865
## 8 0.1056916 -0.2903328 -0.4926242 -0.78414273 -1.08156698 0.3392477
## lstat medv
## 1 0.1641254 -0.25817616
## 2 2.0102765 -1.39479170
## 3 -0.5625097 0.09778119
## 4 1.1195132 -1.03254705
## 5 -0.9678176 0.83523455
## 6 0.3945960 -0.31539874
## 7 0.6799267 -0.51668841
## 8 -0.9695286 1.72241105
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3 LD4 LD5
## crim 0.19364384 0.126100236 -0.20191652 0.376580103 -0.838531764
## zn -0.05343615 1.094813039 1.50799254 1.133131460 0.197678623
## indus 0.55399704 -1.280702654 1.13241037 1.197258173 0.124603383
## chas -0.03302284 -0.024772810 -0.14210875 -0.080434989 0.122406528
## nox 0.05435103 -0.424948658 0.26397352 0.771840326 0.285880375
## rm -0.04813200 0.181941956 0.09971698 -0.277384152 0.689446855
## age 0.14103637 -0.651476207 -0.09019746 -0.008315478 0.502754473
## dis -0.33219936 0.286780273 0.03476554 0.281273725 -0.252480041
## rad 5.93891512 1.865585385 -1.41594980 -0.744695993 0.427251550
## tax -0.05791010 0.519204288 0.49438996 0.580198607 0.087932891
## ptratio 0.21221433 -0.011384841 0.01954111 -0.048612579 -0.116143664
## black -0.19991944 0.269843607 -1.52326403 1.543950260 -0.002263929
## lstat 0.07512642 -0.006494829 0.08239517 -0.043407190 -0.209298121
## medv -0.08095390 0.223086304 -0.03618324 0.137450228 0.487212102
## LD6 LD7
## crim 1.055793974 -0.76410951
## zn 0.472879553 0.75684606
## indus -0.908040391 -1.05671771
## chas 0.139637868 0.18984158
## nox -0.186010320 -0.32881295
## rm 0.077931268 -0.31058189
## age 0.573790467 0.91403286
## dis -0.721069945 -0.66003474
## rad -0.851480550 -0.28819620
## tax 0.245047274 0.77133610
## ptratio 0.008931086 0.50612180
## black -0.001693988 0.02201946
## lstat 0.741261471 -0.23456713
## medv 0.594939439 -0.54523880
##
## Proportion of trace:
## LD1 LD2 LD3 LD4 LD5 LD6 LD7
## 0.7632 0.1051 0.0506 0.0403 0.0205 0.0141 0.0062
#plotting
plot(lda.fit2, dimen = 2)
lda.arrows(lda.fit2, myscale = 1)
############ SUPERBONUS ############
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
#Next, install and access the plotly package. Create a 3D plot (cool!) of the columns of the matrix product using the code below.
#install.packages("plotly")
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color= 'train$crime', colors=c('red','green','blue','violet'))
date()
## [1] "Mon Dec 4 19:57:32 2023"
R-Script: https://github.com/venkatharina/IODS-project/blob/Data/create_human.R
library(dplyr)
library(readr)
library(corrplot)
library(tibble)
library(GGally)
#reading data in
human <- read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/human2.csv")
## Rows: 155 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Country
## dbl (8): Edu2.FM, Labo.FM, Life.Exp, Edu.Exp, GNI, Mat.Mor, Ado.Birth, Parli.F
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#moving the country names to rownames
human_ <- column_to_rownames(human, "Country")
#visualisation of data:
ggpairs(human_, progress = FALSE)
#summary of data:
summary(human_)
## Edu2.FM Labo.FM Life.Exp Edu.Exp
## Min. :0.1717 Min. :0.1857 Min. :49.00 Min. : 5.40
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:66.30 1st Qu.:11.25
## Median :0.9375 Median :0.7535 Median :74.20 Median :13.50
## Mean :0.8529 Mean :0.7074 Mean :71.65 Mean :13.18
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:77.25 3rd Qu.:15.20
## Max. :1.4967 Max. :1.0380 Max. :83.50 Max. :20.20
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
#correlation matrix and its visual representation:
cor(human_)
## Edu2.FM Labo.FM Life.Exp Edu.Exp GNI
## Edu2.FM 1.000000000 0.009564039 0.5760299 0.59325156 0.43030485
## Labo.FM 0.009564039 1.000000000 -0.1400125 0.04732183 -0.02173971
## Life.Exp 0.576029853 -0.140012504 1.0000000 0.78943917 0.62666411
## Edu.Exp 0.593251562 0.047321827 0.7894392 1.00000000 0.62433940
## GNI 0.430304846 -0.021739705 0.6266641 0.62433940 1.00000000
## Mat.Mor -0.660931770 0.240461075 -0.8571684 -0.73570257 -0.49516234
## Ado.Birth -0.529418415 0.120158862 -0.7291774 -0.70356489 -0.55656208
## Parli.F 0.078635285 0.250232608 0.1700863 0.20608156 0.08920818
## Mat.Mor Ado.Birth Parli.F
## Edu2.FM -0.6609318 -0.5294184 0.07863528
## Labo.FM 0.2404611 0.1201589 0.25023261
## Life.Exp -0.8571684 -0.7291774 0.17008631
## Edu.Exp -0.7357026 -0.7035649 0.20608156
## GNI -0.4951623 -0.5565621 0.08920818
## Mat.Mor 1.0000000 0.7586615 -0.08944000
## Ado.Birth 0.7586615 1.0000000 -0.07087810
## Parli.F -0.0894400 -0.0708781 1.00000000
corrplot(cor(human_))
#description of data:
#Seems that Perli.F and LaboFM has least correlations with other variables
#Edu.FM2 seem to positively correlate with Life.Exp, Edu.Exp and GNI and
#negatively correlate with Mat.Mor and Ado.Birth
#Life.Exp positively correlates with Edu2.FM, Edu.Exp and GNI and
#negatively correlates to Mat.Mor and Ado.Birth
#Overall higher education seems to decrease adolescent birth rate and
#maternal mortality
#Principal component analysis (PCA) with biplot
pca_human <- prcomp(human_)
pca_human
## Standard deviations (1, .., p=8):
## [1] 1.854416e+04 1.855219e+02 2.518701e+01 1.145441e+01 3.766241e+00
## [6] 1.565912e+00 1.912052e-01 1.591112e-01
##
## Rotation (n x k) = (8 x 8):
## PC1 PC2 PC3 PC4 PC5
## Edu2.FM -5.607472e-06 0.0006713951 -3.412027e-05 -2.736326e-04 -0.0022935252
## Labo.FM 2.331945e-07 -0.0002819357 5.302884e-04 -4.692578e-03 0.0022190154
## Life.Exp -2.815823e-04 0.0283150248 1.294971e-02 -6.752684e-02 0.9865644425
## Edu.Exp -9.562910e-05 0.0075529759 1.427664e-02 -3.313505e-02 0.1431180282
## GNI -9.999832e-01 -0.0057723054 -5.156742e-04 4.932889e-05 -0.0001135863
## Mat.Mor 5.655734e-03 -0.9916320120 1.260302e-01 -6.100534e-03 0.0266373214
## Ado.Birth 1.233961e-03 -0.1255502723 -9.918113e-01 5.301595e-03 0.0188618600
## Parli.F -5.526460e-05 0.0032317269 -7.398331e-03 -9.971232e-01 -0.0716401914
## PC6 PC7 PC8
## Edu2.FM 2.180183e-02 6.998623e-01 7.139410e-01
## Labo.FM 3.264423e-02 7.132267e-01 -7.001533e-01
## Life.Exp -1.453515e-01 5.380452e-03 2.281723e-03
## Edu.Exp 9.882477e-01 -3.826887e-02 7.776451e-03
## GNI -2.711698e-05 -8.075191e-07 -1.176762e-06
## Mat.Mor 1.695203e-03 1.355518e-04 8.371934e-04
## Ado.Birth 1.273198e-02 -8.641234e-05 -1.707885e-04
## Parli.F -2.309896e-02 -2.642548e-03 2.680113e-03
biplot(pca_human, choices = 1:2)
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
#Principal component analysis (PCA) with biplot with standardized data
human_std <- scale(human_)
pca_human2 <- prcomp(human_std)
pca_human2
## Standard deviations (1, .., p=8):
## [1] 2.0708380 1.1397204 0.8750485 0.7788630 0.6619563 0.5363061 0.4589994
## [8] 0.3222406
##
## Rotation (n x k) = (8 x 8):
## PC1 PC2 PC3 PC4 PC5
## Edu2.FM -0.35664370 0.03796058 -0.24223089 0.62678110 0.5983585
## Labo.FM 0.05457785 0.72432726 -0.58428770 0.06199424 -0.2625067
## Life.Exp -0.44372240 -0.02530473 0.10991305 -0.05834819 -0.1628935
## Edu.Exp -0.42766720 0.13940571 -0.07340270 -0.07020294 -0.1659678
## GNI -0.35048295 0.05060876 -0.20168779 -0.72727675 0.4950306
## Mat.Mor 0.43697098 0.14508727 -0.12522539 -0.25170614 0.1800657
## Ado.Birth 0.41126010 0.07708468 0.01968243 0.04986763 0.4672068
## Parli.F -0.08438558 0.65136866 0.72506309 0.01396293 0.1523699
## PC6 PC7 PC8
## Edu2.FM -0.17713316 0.05773644 0.16459453
## Labo.FM 0.03500707 -0.22729927 -0.07304568
## Life.Exp 0.42242796 -0.43406432 0.62737008
## Edu.Exp 0.38606919 0.77962966 -0.05415984
## GNI -0.11120305 -0.13711838 -0.16961173
## Mat.Mor -0.17370039 0.35380306 0.72193946
## Ado.Birth 0.76056557 -0.06897064 -0.14335186
## Parli.F -0.13749772 0.00568387 -0.02306476
biplot(pca_human2, choices = 1:2)
#standardized model is much more readable than non-standardized
#non-standardized:
#GNI seems higher in countries like Singapore, Norway
#From the rest, it is hard to find out
#standardized:
#If we see from the center of the graph:
#Up we seem more females in parliament (Parli.F) and Labour force of females (Labo.FM) (counties like Bolivia, Tanzania)
#Left we see more Education expectation, GNI, educated females, life expectancy (countries like Singapore, Ireland, Canada)
#On right we see maternity mortality and adolescents birth rates (countries like Sierra Leone, Gambia, Burkina Faso)
#Personal opinion:
#In more female educational countries, GNI and life expectancy seems higher and adolescent birth rate and maternal mortality rates are lower
#And males and females have equal labor force, and equal parliament
#Many Western Countries seem to be in this group
#Tea-dataset:
#300 individuals were asked how they drink tea (18 questions) and what are their product's perception (12 questions).
#In addition, some personal details were asked (4 questions).
#loading data:
tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)
#visualisation of data:
view(tea)
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
## $ frequency : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea) #300rows and 36variables
## [1] 300 36
#selecting certain columns:
library("FactoMineR")
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
tea_time <- select(tea, keep_columns)
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
## # Was:
## data %>% select(keep_columns)
##
## # Now:
## data %>% select(all_of(keep_columns))
##
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
#plotting certain columns:
library(ggplot2)
library(tidyr)
gather(tea_time) %>%
ggplot(aes(value)) + facet_wrap("key", scales="free") + geom_bar() + theme(axis.text.x=element_text(angle=45, hjust=1))
## Warning: attributes are not identical across measure variables; they will be
## dropped
#MCA analysis on partial data:
mca <- MCA(tea_time, graph = FALSE)
summary(mca)
##
## Call:
## MCA(X = tea_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.279 0.261 0.219 0.189 0.177 0.156 0.144
## % of var. 15.238 14.232 11.964 10.333 9.667 8.519 7.841
## Cumulative % of var. 15.238 29.471 41.435 51.768 61.434 69.953 77.794
## Dim.8 Dim.9 Dim.10 Dim.11
## Variance 0.141 0.117 0.087 0.062
## % of var. 7.705 6.392 4.724 3.385
## Cumulative % of var. 85.500 91.891 96.615 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.298 0.106 0.086 | -0.328 0.137 0.105 | -0.327
## 2 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 3 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 4 | -0.530 0.335 0.460 | -0.318 0.129 0.166 | 0.211
## 5 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 6 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 7 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 8 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 9 | 0.143 0.024 0.012 | 0.871 0.969 0.435 | -0.067
## 10 | 0.476 0.271 0.140 | 0.687 0.604 0.291 | -0.650
## ctr cos2
## 1 0.163 0.104 |
## 2 0.735 0.314 |
## 3 0.062 0.069 |
## 4 0.068 0.073 |
## 5 0.062 0.069 |
## 6 0.062 0.069 |
## 7 0.062 0.069 |
## 8 0.735 0.314 |
## 9 0.007 0.003 |
## 10 0.643 0.261 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
## black | 0.473 3.288 0.073 4.677 | 0.094 0.139 0.003
## Earl Grey | -0.264 2.680 0.126 -6.137 | 0.123 0.626 0.027
## green | 0.486 1.547 0.029 2.952 | -0.933 6.111 0.107
## alone | -0.018 0.012 0.001 -0.418 | -0.262 2.841 0.127
## lemon | 0.669 2.938 0.055 4.068 | 0.531 1.979 0.035
## milk | -0.337 1.420 0.030 -3.002 | 0.272 0.990 0.020
## other | 0.288 0.148 0.003 0.876 | 1.820 6.347 0.102
## tea bag | -0.608 12.499 0.483 -12.023 | -0.351 4.459 0.161
## tea bag+unpackaged | 0.350 2.289 0.056 4.088 | 1.024 20.968 0.478
## unpackaged | 1.958 27.432 0.523 12.499 | -1.015 7.898 0.141
## v.test Dim.3 ctr cos2 v.test
## black 0.929 | -1.081 21.888 0.382 -10.692 |
## Earl Grey 2.867 | 0.433 9.160 0.338 10.053 |
## green -5.669 | -0.108 0.098 0.001 -0.659 |
## alone -6.164 | -0.113 0.627 0.024 -2.655 |
## lemon 3.226 | 1.329 14.771 0.218 8.081 |
## milk 2.422 | 0.013 0.003 0.000 0.116 |
## other 5.534 | -2.524 14.526 0.197 -7.676 |
## tea bag -6.941 | -0.065 0.183 0.006 -1.287 |
## tea bag+unpackaged 11.956 | 0.019 0.009 0.000 0.226 |
## unpackaged -6.482 | 0.257 0.602 0.009 1.640 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.126 0.108 0.410 |
## How | 0.076 0.190 0.394 |
## how | 0.708 0.522 0.010 |
## sugar | 0.065 0.001 0.336 |
## where | 0.702 0.681 0.055 |
## lunch | 0.000 0.064 0.111 |
print(mca)
## **Results of the Multiple Correspondence Analysis (MCA)**
## The analysis was performed on 300 individuals, described by 6 variables
## *The results are available in the following objects:
##
## name description
## 1 "$eig" "eigenvalues"
## 2 "$var" "results for the variables"
## 3 "$var$coord" "coord. of the categories"
## 4 "$var$cos2" "cos2 for the categories"
## 5 "$var$contrib" "contributions of the categories"
## 6 "$var$v.test" "v-test for the categories"
## 7 "$var$eta2" "coord. of variables"
## 8 "$ind" "results for the individuals"
## 9 "$ind$coord" "coord. for the individuals"
## 10 "$ind$cos2" "cos2 for the individuals"
## 11 "$ind$contrib" "contributions of the individuals"
## 12 "$call" "intermediate results"
## 13 "$call$marge.col" "weights of columns"
## 14 "$call$marge.li" "weights of rows"
#plotting MCA analysis
plot(mca, invisible=c("ind"), graph.type = "classic")
#from how variable: "tea bag" seems to be most popular (in center)
#from How variable: tea "alone" seems to be most popular followed by milk and lemon
#from lunch variable: "not.lunch" seem to be most popular
#from sugar variable: "sugar" and "no.sugar" seem to be pretty even
#from tea variable: "Earl Grey" and "black" tea seem more popular than "green"
#from where variable: "chain store" seem the most popular
#to summerize: seem people prefer to drink tea other than lunch time,
#only tea alone, that is Earl Grey or black tea, with sugar or without sugar,
#from a teabag, that has been bought from chain store
(more chapters to be added similarly as we proceed with the course!)